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Abstract 


Face masks affect the transmission of speech and obscure facial cues. Here, we examine how this reduction in acous- 
tic and facial information affects a listener's understanding of speech prosody. English sentence pairs that differed in 
their intonational (statement/question) and emotional (happy/sad) prosody were created. These pairs were recorded 
by a masked and unmasked speaker and manipulated to contain audio or not. This resulted in a continuum from 
typical unmasked speech with audio (easiest) to masked speech without audio (hardest). English listeners (N= 129) 
were tested on their discrimination of these statement/question and happy/sad pairs. We also collected six individual 


difference measures previously reported to affect various linguistic processes: Autism Spectrum Quotient, musical 
background, phonological short-term memory (digit span, 2-back), and congruence task (flanker, Simon) behavior. 
The results indicated that masked statement/question and happy/sad prosodies were harder to discriminate than 
unmasked prosodies. Masks can therefore make it more difficult to understand a speaker's intended intonation or 
emotion. Importantly, listeners differed considerably in their ability to understand prosody. When wearing a mask, 
speakers should try to speak clearer and louder, if possible, and make intentions and emotions explicit to the listener. 
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Significance statement 

For surgeons and painters, communication in face masks 
is common. For others, COVID-19 marked the begin- 
ning of talking (speech production) and listening (speech 
perception) while wearing a mask. Masks can affect the 
transmission of the speech signal and obscure facial cues. 
This change in listening conditions has affected people 
differently. What are some of the factors that cause this 
individual variability in listeners? This study explored 
that question in terms of speech prosody. The utter- 
ance “it’s raining” can be a statement (flat intonation) or 
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a question (rising intonation). Prosody is often accom- 
panied with facial cues, such as head tilts and eyebrow 
raises. Masks can muffle speech cues and hide facial cues, 
which can make prosody difficult to understand. Our 
study found that masks make it harder to understand a 
speaker’s statement/question intonational prosody and 
happy/sad emotional prosody. Among the individual 
differences we tested, we found that Autism Spectrum 
Quotient predicted some performance on the prosody 
discrimination task. The findings have potential edu- 
cational and clinical implications. When speaking with 
a mask, speakers should increase pitch and volume, if 
possible. Because facial cues may be obscured, speakers 
should also be more explicit about their intended emo- 
tions/questions (e.g., “I’m happy it’s raining” “I have a 
question: is it raining?”). 
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Introduction 

To fight the spread of the COVID-19 virus, facial mask 
mandates were put in place by governments throughout 
the world. For many people, this was the first time both 
the speaker and listener wore masks during communi- 
cation. Masks have acoustic and visual consequences. 
Acoustically, the materials made to reduce the transmis- 
sion of pathogens also reduce sound transmission (Magee 
et al., 2020). As a result, masks can reduce a speaker’s 
fundamental frequency (FO: what listeners perceive as 
pitch) and amplitude (what listeners perceive as volume 
or loudness). For many listeners, this reduction in acous- 
tic information makes understanding speech more diffi- 
cult (e.g., Brown et al., 2021; Fiorella et al., 2021; Mheidly 
et al., 2020). Visually, a mask obscures the mouth and 
hides facial cues. Visual information like mouth move- 
ments can help a listener better understand acoustic 
information (e.g., Best, 1995; Fowler, 1986; Saunders 
et al., 2021). For example, the relatively similar sound- 
ing English speech sounds /s/ and /f/ differ in their lip- 
rounding, which listeners can use to better understand 
whether the speaker needs to sip the bottle or ship the 
bottle. For those listeners with hearing problems, com- 
municating in noisy environments, and listening to non- 
native speech, visual cues can be very helpful (Fiorella 
et al., 2021; House et al., 2001; Sueyoshi & Hardison, 
2005; Winn et al., 2013). 

In the present study, we extend recent research into 
masks and speech perception by examining the percep- 
tion of speech prosody and masks. Prosody is a broad 
term that includes pitch, stress, rhythm, and intonation 
(e.g., Cutler, 2012; Cutler et al., 1997). It is often described 
as not what a speaker says, but how it is said. For exam- 
ple, a student telling a friend, “Class is cancelled” could 
convey happiness because it is a boring class or sadness 
because it is the student’s favorite class. Acoustic cues 
like FO and amplitude (among others) change given the 
prosody of the speech. Here, we examine intonational 
statement/question prosodies and emotional happy/sad 
prosodies produced with and without masks. Statements 
are usually characterized by their relatively falling volume 
and pitch, whereas questions are usually characterized 
by their relatively rising volume and pitch (Gussenhoven 
& Chen, 2000; Pell, 2001; Srinivasan & Massaro, 2003). 
Happy speech is typically characterized by its relatively 
high volume and high pitch; in contrast, sad speech is 
typically characterized by its relatively low volume and 
low pitch (Banziger & Scherer, 2005; Scherer, 2003; Sobin 
& Alpert, 1999). 

Smiling, frowning, and raising and lowering the head 
regularly accompany speech prosody (Graf et al., 2002; 
Granström & House, 2005; Granström et al., 1999). Lis- 
teners can use these facial cues to better understand the 
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speaker’s intent and emotions (Lansing & McConkie, 
1999; Munhall et al., 2004). Masks can obscure some of 
these cues, which makes detecting a speaker’s emotions 
more difficult (Carbon, 2020). Yet, adults differ consid- 
erably in their ability to interpret these visual cues (e.g., 
Gandour et al., 2003; Lambrecht et al., 2012; Rymarczyk 
& Grabowska, 2007). This individual variation in prosody 
perception (e.g., Cole, 2015; Ward, 2019) is the focus of 
our study. Here, we examine how listeners differ in their 
perception of intonational (statement/question) and 
emotional (happy/sad) prosody in which the target sen- 
tences contain identical words, but contrast in their per- 
ceived loudness and pitch, as well as subtle facial cues. 

Given that COVID-19 made laboratory-based data 
collection difficult, we collected data on behavioral dif- 
ferences using short, reliable tasks easily administered 
via the internet. This extends previous laboratory-based 
research on the individual differences in listeners’ per- 
ception of prosody (e.g., Ferreira & Karimi, 2015; Jun & 
Bishop, 2015) by testing a larger, more diverse sample 
size beyond the laboratory. We manipulated the presence 
of masks and audio to create a continuum from typical 
unmasked speech with audio (easiest), to masked speech 
without audio (hardest). We also examined four measures 
previously reported to affect language processes: Autism 
Spectrum Quotient, musical background, phonological 
short-term memory, and congruence task behavior. 

Individuals diagnosed with autism spectrum disorder 
tend to struggle with the perception of prosody, rec- 
ognition of emotions, and overcoming face masks (see 
McCann & Peppé, 2003 for a review). In particular, adult 
listeners diagnosed with autism spectrum disorder strug- 
gle to recognize the facial and acoustic cues associated 
with emotions like happy and sad (e.g., Clark et al., 2008; 
Peppé et al., 2011; Philip et al., 2010). We predict those 
listeners with greater “autistic” traits will be less accu- 
rate at discriminating prosody than those listeners with 
fewer “autistic” traits. This difference may be particularly 
noticeable in the happy/sad prosody given that facial cues 
typically accompany this speech, and in masked speech 
given that facial cues are further reduced. 

Music and language are believed to be processed in 
similar parts of the brain given their shared use of voice 
and rhythm (Patel, 2010). Musicians typically outper- 
form non-musicians in a wide range of prosody percep- 
tion tasks involving rhythm, stress, tone, and emotion 
(e.g., Hausen et al., 2013; Lima & Castro, 2011; Thomp- 
son et al., 2004). This is typically attributed to musicians’ 
greater sensitivity to pitch (FO cues) and volume (ampli- 
tude cues) as a result of their training. We predict those 
listeners with greater musical training will more accu- 
rately discriminate happy/sad and statement/question 
prosody than those listeners with less or no training. 
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Musicians may also show an advantage over non-musi- 
cians in masked speech given their greater sensitivity to 
FO and amplitude cues. 

Phonological short-term memory affects a wide range 
of linguistic processes, including prosody recall, discrimi- 
nation, and categorization (e.g., Baddeley et al., 1984; Jac- 
quemot & Scott, 2006; Lambrecht et al., 2012; Stepanov 
et al., 2020). Because accurate perception of prosody 
requires not only accessing the meaning of the words 
(phonological sound-to-meaning mapping), but also rec- 
ognizing how that meaning may change given variations 
in the acoustics heard at a later point in time (see Cutler, 
2012 for a review), a listener with greater phonological 
short-term memory may be able to better store mean- 
ing and acoustics than a listener with weaker phonologi- 
cal short-term memory. This may be especially helpful 
for overcoming masked speech, which typically damp- 
ens prosodic cues like pitch and volume. We predict that 
individuals with greater phonological short-term mem- 
ory will more accurately discriminate prosody than those 
with lesser phonological short-term memory. 

Finally, listeners differ in their ability to focus on the 
task at hand when distracted. A large body of psycho- 
metric literature has examined how the ability to resist 
distractor interference and inhibit pre-potent responses 
contributes to human behavior (often discussed as “cog- 
nitive inhibition,” see Lu & Proctor, 1995; Rey-Mermet, 
Gade, & Oberauer, 2018). Many congruency tasks in 
which the participant must ignore and suppress irrel- 
evant or incongruent information have been found to 
predict performance on a variety of linguistic tasks, espe- 
cially tasks involving switching between different lan- 
guages and linguistic units (e.g., Blumenfeld & Marian, 
2014; Pliatsikas & Luk, 2016). We note, however, which 
specific congruence task is used and what linguistic con- 
struct (if any) it predicts, varies considerably across the 
literature (e.g., Hedge et al., 2018; Poarch & Van Hell, 
2012). Here, we carry out exploratory research to exam- 
ine whether behavior on two congruency tasks (one 
linguistic and one non-linguistic) can predict prosody 
discrimination performance. We predict that listeners 
who perform better on congruency tasks, that is, are bet- 
ter able to focus on the task despite incongruent infor- 
mation, will more accurately discriminate prosody than 
those who perform poorly on the congruency task and 
are unable to focus on the task given incongruent infor- 
mation. This difference may be particularly robust in 
masked speech, which can confuse the listener in terms 
of reading emotions (Carbon, 2020). 


Methods 
All stimuli, data, and R code are available on the Open 
Science Framework. 
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https://osf.io/gs79t/. 


Participants 

The experiment was built and run using Gorilla (Anwyl- 
Irvine et al., 2020). A total of 165 participants were 
initially recruited from Prolific (www.prolific.co). All 
participants were 18 years of age or older (mean =32.8) 
with normal hearing. All participants self-identified as 
monolingual English speakers who had learned English 
from birth. Participants were required to use only a desk- 
top or laptop computer. Participants were asked to wear 
headphones and confirm that they would wear them dur- 
ing the entire experiment. The experiment took roughly 
30 min to complete. Participants were paid for their time 
($11/h). The experiment was approved by the Carnegie 
Mellon University Institutional Review Board. From the 
original 165 participants tested, 16 were removed for 
having hearing problems, 5 for failing attention checks, 5 
for data failure, and 11 for below chance performance in 
either the happy/sad or statement/question prosody task. 
This left data from 128 participants, which we report 
below. 


Questionnaires 

Participants completed two questionnaires: Autism 
Spectrum Quotient and Music Use. The Autism Spec- 
trum Quotient (Baron-Cohen et al., 2001) is a brief 
questionnaire containing 50 questions across five areas: 
social skill, attention switching, attention to detail, com- 
munication, and imagination. Each question allowed four 
choices (definitely agree, slightly agree, slightly disagree, 
definitely disagree) and therefore allowed for an estima- 
tion of autism spectrum traits. Following Baron-Cohen 
et al., each autistic-like behavior was scored as 1 (irre- 
spective of whether it was a “definitely” or “slightly” 
response) whereas each non-autistic-like behavior was 
scored as 0 (irrespective of whether it was a “definitely” 
or “slightly” response). This resulted in a total score for 
each participant ranging from 0 to 50 with higher scores 
reflecting greater autistic-like behavior; scores of 32 or 
greater represented what Baron-Cohen et al. call “a useful 
cutoff for distinguishing individuals who have clinically 
significant levels of autistic traits” (2001, 15). The internal 
consistency of the questions, as measured by Cronbach's 
alpha, was 0.87, or “good” (Cronbach, 1951). 

Music Use (MUSE: Chin & Rickard, 2012) is a brief 
questionnaire containing questions aimed at measuring 
levels of music listening, training, and instrument play- 
ing, in addition to music engagement and experience. It 
contains eight open ended and 24 Likert scale questions. 
For the purposes of our study, Music Use scores were cal- 
culated for each participant as a summary score across 
the 24 Likert scale questions. Each Likert scale question 
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allowed for six choices (not applicable to me, strongly 
disagree, disagree, neither agree nor disagree, agree, 
strongly agree) corresponding to a 0 to 5 value. This 
resulted in a total score for each participant ranging from 
0 to 120 with higher scores reflecting greater engagement 
and experience with music. The internal consistency of 
the items was “excellent” (Cronbach’s a =0.93; Cronbach, 
1951). 


Phonological short-term memory tasks 

Participants completed two phonological short-term 
memory tasks: digit span (e.g., Jacquemot & Scott, 2006) 
and 2-back (e.g., Kane et al., 2007). These two tasks 
involved briefly presenting a stimulus and asking the par- 
ticipant to recall it later. Participants were told to remem- 
ber the presented information as they would be asked 
about it later, but not to write anything down. 

The digit span task presented 10 increasingly long 
sequences containing one digit (first sequence) to 10 dig- 
its (tenth sequence). Participants were shown each digit 
for 2000 ms followed by a 100 ms fixation with consecu- 
tive digits repeating in that pattern. After each sequence 
was presented with all its digits, participants were asked 
to type the numbers in the correct order. The largest 
sequence with all its digits correctly recalled was calcu- 
lated for each participant (1-10), i.e., the “highest score” 
method, which typically yields higher reliability than 
a total score across all trials (Groth-Marnat & Baker, 
2003). A larger number represented a greater phonologi- 
cal short-term memory. The internal consistency of the 
items was “acceptable” (Cronbach’s a=0.70; Cronbach, 
1951). 

The 2-back task presented 32 English letters one at a 
time for 2000 ms each. After each letter was presented, 
participants were asked if that letter was presented two 
trials ago. There were nine targets and 23 incorrect foils. 
If the letter was presented two trials ago, participants 
were asked to press the “F” key; if the letter was not, 
participants were asked to press the “J” key. A 2000 ms 
response time limit was given for each letter and the next 
letter was displayed immediately after a button press. 
Because the first two trials were not scored, participants’ 
scores ranged from 0 to 30 and represented the correct 
total trials. A larger number represented a greater pho- 
nological short-term memory. The internal consistency 
of the items was “good” (Cronbach’s a=0.89; Cronbach, 
1951). 


Congruency tasks 

Participants completed two congruency tasks: flanker 
(Eriksen & Eriksen, 1974) and Simon (see Lu & Proctor, 
1995). These two tasks presented multiple stimulus— 
response congruency trials. On each trial, information 
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was either congruent or incongruent and required a key- 
board press from the participant. 

The flanker task showed five cartoon fish in a row 
with the middle fish either facing the same direction as 
the others (congruent) or facing the opposite direction 
(incongruent). Participants were asked to press “F” if the 
middle fish was swimming to the left; “J” if it was swim- 
ming to the right. Four practice trials with feedback were 
presented, followed by 36 trials without feedback. Of the 
36 trials, 18 were congruent (9 swimming left; 9 swim- 
ming right) and 18 were incongruent (9 swimming left 
while others swam right; 9 swimming right while others 
swam left). A 2000 ms response time limit was given for 
each trial. Each trial immediately advanced upon button 
press. The internal consistency of the items was “excel- 
lent” (Cronbach’s a=0.94; Cronbach, 1951). Response 
time results were calculated by first removing incorrect 
trials (1%), and then calculating the Median Absolute 
Deviation using the psych package in R (Leys et al., 2013). 
Outliers were defined as the median plus or minus three 
times the Median Absolute Deviation. Roughly 8% of the 
data were removed as outliers. The remaining correct 
response times from congruent trials were subtracted 
from those of incongruent trials, resulting in a mean RT 
difference for each participant. A larger RT difference 
reflected worse abilities to ignore the incongruent infor- 
mation whereas a smaller RT difference reflected better 
abilities to ignore incongruent information and focus on 
the task. 

The Simon task showed the words “left” and “right” on 
the two sides of the computer screen. Participants were 
asked to press the “F” key if the word “left” appeared on 
the screen and the “J” key if the word “right” appeared, 
irrespective of location. Four practice trials with feed- 
back were presented, followed by 36 trials without feed- 
back (18 congruent, i.e., “left” on the left-hand side and 
“right” on the right-hand side; 18 incongruent, i.e., “left” 
on the right-hand side and “right” on the left-hand side). 
The words “left” and “right” each appeared 18 times. A 
2000 ms response time limit was given for each trial. The 
internal consistency of the items was “poor” (Cronbach's 
a=0.55; Cronbach, 1951). Response time results were 
calculated by first removing incorrect trials (9%), and 
then calculating the Median Absolute Deviation using 
the psych package in R (Leys et al., 2013). Outliers were 
defined as the median plus or minus three times the 
Median Absolute Deviation. Roughly 5% of the data were 
removed as outliers. As with the flanker task, remaining 
correct response times from congruent trials were sub- 
tracted from those of incongruent trials, resulting in a 
mean RT difference for each participant. A larger RT dif- 
ference reflected worse abilities to ignore the incongru- 
ent information whereas a smaller RT difference reflected 
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better abilities to ignore incongruent information and 
focus on the task. 


Prosody task 

Thirty-two statement/question (e.g., “Class is cancelled”) 
and 32 happy/sad (e.g., “It’s time for class”) sentences 
were created. These sentences were designed to be con- 
cise, natural utterances someone might say, with easy-to- 
understand content. This resulted in 128 unique items (32 
statement + 32 question + 32 happy + 32 sad). Half of the 
sentences (16 per prosody type) were recorded unmasked 
and half with a cloth face mask worn securely on the face 
(black, filterless). Recording was done in a quiet room 
with a blank background using an iPad Air (4th Genera- 
tion), placed roughly 2 feet in front of the speaker. The 
speaker was a 20-year-old female university student who 
spoke American English. The videos were filmed in two 
sessions corresponding to the prosody: question/sen- 
tence and happy/sad. Each sentence was said twice in 
succession, with a fixed order of question—statement or 
happy-—sad. The 128 videos were cut into individual files, 
each approximately 2—4 s long and saved as an mp4. 
The audio was recorded at 44.1 kHz using the internal 
microphone from the iPad. The audio was unaltered for 
the purposes of the study. Each item was labeled and 
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analyzed in Praat (version 6.1.08; Boersma & Weenink, 
2021) with all normalized measurements obtained using 
ProsodyPro (Xu, 2013). Figure 1 plots the normalized 
FO and amplitude measurements over three normalized 
time points. The solid line represents a smoothed best 
fit with gray 95% confidence intervals. This figure shows 
that masked speech had a lower overall normalized mean 
FO and amplitude than unmasked speech. 

Statistically modeling prosody with or without a time 
variable can be done using different approaches (see Xu 
& Prom-on, 2015). We were interested in demonstrat- 
ing a difference between statement and question proso- 
dies, happy and sad prosodies, and speech produced with 
and without a mask. We therefore calculated an overall 
mean (across the three normalized time points) for each 
of the 128 items. For FO, a two-way ANOVA confirmed 
differences between masked and unmasked speech 
[F(1, 120) =6.27, p=0.01, y, =0.05], and prosody [F(3, 
120) = 29.19, p<0.001, p =0.42], but no two-way inter- 
action (p =0.95). The same pattern was found for ampli- 
tude: masked [F(1, 120)=9.76, p= 0.002; Np =0.08], 
prosody [(F(3, 120)=28.21, p<0.001; 4, =0.41)], null 
two-way interaction (p=0.68). Tukey-adjusted pair- 
wise comparisons confirmed that the prosody between 
statement and question items and happy and sad items 
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Fig. 1 Normalized FO (top) and amplitude (bottom) over three normalized time points. Solid lines represent a smoothed best fit with gray 95% 
confidence intervals 


Sinagra and Wiener Cognitive Research: Principles and Implications 


differed from one another in mean FO and amplitude 
(ps<0.05). Although masked speech lowered mean 
FO and amplitude overall, after correcting for multi- 
ple comparisons no difference was found in any of the 
eight individual comparisons (ps >0.05). In other words, 
the decrease in FO and amplitude were aggregate effects 
across all masked items. 

An additional set of 128 videos were created by remov- 
ing the audio track with the program ffmpeg (Tomar, 
2006). This resulted in a total of 256 items (128 with 
audio +128 without audio). From these 256 items, two 
half lists of the 128 items were created. This kept the 
prosody task under 15 min, guarded against potential 
boredom, and meant that each participant heard a sen- 
tence in only one type of prosody, not both, and as either 
masked or unmasked, not both. For example, in half list 
1 the statement, “You burned it” was presented with 
audio and the question, “You burned it?” was presented 
without audio. In half list 2, the question was presented 
with audio and the statement without audio. These two 
half lists contained eight different prosody—mask— 
audio blocks (2x 2x2) with 16 items in each block 
(8 x 16=128 items). The order of the eight blocks was 
counterbalanced using a Latin-square design. 

Before beginning the 8 blocks, participants completed 
practice trials explaining the task and asking the listener 
to find a suitable volume. At the start of each block par- 
ticipants were told whether there would be audio or not, 
and to click as quickly and accurately as possible on the 
perceived statement/question or happy/sad intonation. 
In each block, a video was first presented, which could 
only be played once (Fig. 2, left). After mouse-clicking the 
play button, the video began (Fig. 2, center). Participants 
were then presented with the prosody choices (Fig. 2, 
right). Location of each prosody button was counterbal- 
anced across all trials. Participants who did not perform 
above chance (0.5) in either the happy/sad or statement/ 
question condition (V=11) were removed from the data. 


(2022) 7:89 


Page 6 of 12 


Procedure 

Participants completed the experiment in a fixed order: 
IRB information and consent, flanker task, Autism Spec- 
trum Quotient questionnaire, digit-span task, Music Use 
questionnaire, Simon task, prosody task, 2-back task. The 
entire procedure took approximately 30 min to complete. 
An attention check occurred after three tasks and six 
tasks, respectively. In this attention check, participants 
were required to click on a button five times before pro- 
gressing to the next task. Participants who failed to click 
on the button five times within a minute of presentation 
(at either check) were removed (N=5). 


Data analysis 

All analyses were carried out in R (version 4.1.0; R Core 
Team, 2020) with a 0.05 alpha level. The accuracy of the 
prosody task (coded as 1 correct; 0 incorrect) was mod- 
eled using generalized linear mixed-effects models with 
a logit link function using the /me4 package (version 
1.1.29). Two separate models were run corresponding 
to the prosody: statement/question and happy/sad. The 
fixed effects included the three experimental manipu- 
lations of mask, audio, and prosody, all of which were 
dummy coded with the reference levels as: “unmasked,’ 
“audio; and “question prosody” or “happy prosody.’ This 
meant the inclusion of a mask in the “masked” condition 
should reduce the log-odds of correct identification (as 
reflected by a negative coefficient), and the removal of 
audio in the “no audio” condition should reduce the log- 
odds of correct identification (as reflected by a negative 
coefficient). Any difference between prosodies relative to 
the “question” or “happy” prosody will be reflected by a 
positive (i.e., accuracy increase) or negative (i.e., accuracy 
decrease) coefficient. 

The scores from the six individual differences tasks 
were first standardized and then included as fixed effects 
in the models. Positive coefficients reflect an increase 
in log-odds of correct identification (given a one-unit 


eo 


Fig. 2 Prosody task sequence from left to right: mouse-click play button, video played only once, prosody options 


Sinagra and Wiener Cognitive Research: Principles and Implications 


increase for the variable) whereas negative coefficients 
reflect a decrease in log-odds of correct identification 
(given a one-unit increase for the variable). For example, 
a positive coefficient for Autism Spectrum Quotient will 
indicate that more autistic-like behavior resulted in an 
increase in log-odds of correct identification (i.e., accu- 
racy increase) whereas a negative coefficient will indicate 
that more autistic-like behavior resulted in a decrease 
in log-odds of correct identification (ie, accuracy 
decrease). For Music Use, digit span, and 2-back tasks, we 
expect a positive coefficient as better music abilities and 
better phonological short-term memory should lead to 
more accurate identification. For Autism Spectrum Quo- 
tient, we expect a negative coefficient as more autistic- 
like behavior should lead to less accurate identification. 
For flanker and Simon tasks, we also expect a negative 
coefficient as a larger difference between congruent and 
incongruent trials (i.e., greater incongruity costs) should 
lead to poorer identification accuracy. 

For each analysis, the maximal model was first fit. The 
maximal model contained fixed effects of mask, audio, 
prosody, and all six individual differences measures. Both 
two-way and three-way interactions were included in 
the model. Because mask, audio, and prosody were not 
manipulated within-items, and because any individual 
difference effect should be constant by participant (see 
Barr et al., 2013 and Brown, 2021 for discussions), the 
model therefore contained by-subject random slopes for 
mask, audio, prosody and by-item random slopes for all 
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six individual difference tasks. If this model did not con- 
verge or produced a singular fit, random slopes that con- 
tributed the least amount of variance were removed until 
the model converged without a singular fit. Model quality 
(conditional R? and Bayesian Information Criterion) was 
assessed using the performance package (Liidecke et al., 
2021). The final statement/question model: 

(accuracy~mask * autism quotient+ autism quo- 
tient * prosody+audio * autism quotient+simon 
mask+simon * prosody+audio * simon-+2-back 
* mask+2-back * prosody+audio * 2-back-+ digit 
span * mask+digit span * prosody+audio * digit 
span+flanker * mask+flanker * prosody+audio * 
flanker + music use * mask + music use * prosody + audio 
* music use + mask * audio * prosody + (autism quotient | 
item) + (mask + prosody + audio | participant)). 

The final happy/sad model: 

(accuracy~mask * autism quotient+ autism quo- 
tient * prosody+audio * autism quotient+ simon 
mask+simon * prosody+audio * simon + 2-back 
mask + 2-back * prosody + audio * 2-back + digit span * 
mask + digit span * prosody + audio * digit span + flanker 
* mask + flanker * prosody + audio * flanker + music use * 
mask + music use * prosody+ audio * music use + mask 
* audio * prosody+(2-back | item)+(mask-+ pros- 
ody + audio | participant)). 


* 


* 


* 


Statement Question 
Autism 100% ee? Sse pty: E pA 
Spectrum ———— k e? me Pie 
Quotient 75% A ) 
Music 
a 
Use 50% 
= © 25% 
Digit 9 Masked 
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Fig. 3 Standardized individual predictors (left) and performance on eight prosody-mask—audio conditions in the prosody task (right) 
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Fig. 4 Mixed-effects logistic regression estimates for statement/question prosody model (left) and happy/sad prosody model (right). The plot 
shows the log-odds estimate along with 95% confidence intervals. Darker shade indicates a positive estimate; lighter shade indicates a negative 
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Results 
Figure 3 (left) plots the standardized results from the six 
individual differences tasks, all of which showed slightly 
different distributions. After correcting for multiple com- 
parisons, none of the performances on the six variables 
were correlated with one another (ps>0.05). Figure 3 
(right) plots accuracy in the prosody task. Each point rep- 
resents one participant’s mean given the eight prosody- 
mask—audio conditions. Condition means are plotted in 
the large points. For the majority of participants, the task 
was relatively easy, but for some, this was a difficult task. 
Figure 4 plots the two models’ log-odds and 95% con- 
fidence intervals. For each estimate, the darker shade 
indicates a positive estimate whereas the lighter shade 
indicates a negative estimate. The statement/question 
model (conditional R?= 0.53; BIC = 4441.2) revealed 
masked speech was harder to identify than unmasked 
speech (B= -— 1.04, SE=0.39, Z=- 2.65, p=0.008), 
and videos with no audio were harder to identify than 
videos with audio ($= -— 1.32, SE=0.39, Z=- 3.39, 
p<0.001). One individual difference predictor was 
found to be significant: Autism Spectrum Quotient 
(6=—0.52, SE=0.20, Z=- 2.58, p=0.01). All other 
predictors and interactions were null (ps > 0.05). 


The happy/sad model (conditional R?=0.53; 
BIC =2156.5) revealed masked speech was harder to 
identify than unmasked speech ($= -— 1.44, SE=0.51, 
Z=-— 2.79, p=0.005). All other predictors and interac- 
tions were null (ps > 0.05). 

In sum, for both models, masked speech significantly 
reduced the log-odds of correct identification. Autism 
Spectrum Quotient also reduced the log-odds of correct 
identification; however, this reduction was only statisti- 
cally significant in the statement/question model. Neither 
model revealed a significant effect of prosody. The lack of 
audio resulted in a reduced log-odds of correct identifica- 
tion only in the statement/question model. All other pre- 
dictors (and interactions) neither significantly increased 
nor decreased the log-odds of correct identification at an 
alpha level of 0.05. 


Discussion 

This exploratory study set out to examine how face masks 
affect listeners’ perception of statement/question into- 
national prosody and happy/sad emotional prosody. We 
were specifically interested in six individual differences 
across participants and tested whether they predicted 
performance on the prosody task. We present three find- 
ings from our study. 
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First, we found that masks affected the discrimina- 
tion of both statement/question and happy/sad prosody. 
Items produced with a mask were harder for participants 
to correctly discriminate than those produced without a 
mask. The acoustic and facial cues typically relied on for 
prosody discrimination were reduced as a result of the 
face mask, which in turn reduced listeners’ accuracy. This 
extends recent findings, which showed that masks do not 
necessarily affect individual word recognition accuracy 
in speech presented without background noise (Magee 
et al., 2020; Smiljanic et al., 2021; though see Brown et al., 
2021; Toscano & Toscano, 2021 for speech presented 
in noise). Our results, however, indicate that masks can 
affect prosody discrimination of speech presented with- 
out background noise. These results also underscore 
how lexical access alone is not sufficient for understand- 
ing emotional and intonational prosody (Cutler, 2012). 
Importantly, as Fig. 3 (right) shows, we found a consid- 
erable range of behavior in our prosody discrimination 
task, which supports the observation that daily commu- 
nication with masks is more challenging for some listen- 
ers than it is for others. We found this to be particularly 
true for perception of prosody when the speaker was 
masked. 

Second, we found that the lack of audio only affected 
discrimination of statement/question prosody. These 
results are most likely because our speaker conveyed 
more facial cues while producing happy/sad prosody 
than statement/question prosody. Because the stimuli 
were recorded outside of a lab, we did not control facial 
cues as tightly as we would have preferred. In a post hoc 
exploratory analysis, we had 10 new participants rate the 
128 videos for their facial movement (1—no movement; 
5—full body/face movement). As expected, the state- 
ment/question videos were rated as having, on average, 
less facial movement than the happy/sad videos. These 
preliminary results suggest that facial cues were more 
helpful in determining the happy/sad contrast than they 
were in determining the statement/question contrast. 
Moreover, with enough facial cues, audio may not be a 
necessary condition for correct emotional prosody iden- 
tification (e.g., Lansing & McConkie, 1999; Munhall 
et al., 2004). Indeed, while some participants struggled in 
our task, many of our participants did not make a mis- 
take in the prosody task. For now, we are unable to say 
whether this difference in facial cues or something inher- 
ent to statement/question prosody caused the results and 
therefore refrain from further speculation. 

Third, we observed high participant variability in our 
results in line with previous studies on the individual dif- 
ferences in prosody perception (e.g., Baumann & Win- 
ter, 2018; Roy et al., 2017). We found a general trend in 
that participants with higher Autism Spectrum Quotient 
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scores (i.e., more autistic-like behavior) struggled to iden- 
tify prosody correctly. This was a statistically significant 
effect for statement/question items but a nonsignificant 
effect for happy/sad items. These Autism Spectrum Quo- 
tient results, in part, support previous research on autism 
and prosody (e.g., McCann & Peppé, 2003; Philip et al., 
2010; Paul et al., 2005), which has shown that individuals 
with autism spectrum disorder typically struggle to pro- 
cess emotional information quickly, including verbal and 
nonverbal emotional cues (e.g., Clark et al., 2008; Eack 
et al., 2015; Peppé et al., 2011). Autistic listeners also 
often struggle to perceive subtle acoustic differences such 
as FO rise and fall, i.e., a primary cue in statements and 
questions (Jarvinen-Pasley et al., 2008; though see Wang 
et al., 2021 for conflicting results). We also note that we 
did not find an interaction between Autism Spectrum 
Quotient and masks, which suggests that masks did not 
disproportionally affect those listeners with more autistic 
traits. 

With regards to the other individual difference pre- 
dictors, we found that music experience and use had 
no effect on performance in the prosody task. This is a 
somewhat unexpected finding as previous research has 
indicated that musical training and musical experience 
tends to result in an overall improved prosody percep- 
tion, particularly in emotional and intonational prosody 
(e.g., Hausen et al., 2013; Lima & Castro, 2011; Thomp- 
son et al., 2004). This null effect may have been driven, 
in part, by the relatively high accuracy participants 
reached in the task. Therefore, any musical advantage 
may not have been needed given how easy the task was 
for participants. 

We found that congruence task behavior (flanker, 
Simon) did not predict prosody task behavior. These 
results go against our initial prediction that performance 
on congruency tasks may predict masked prosody dis- 
crimination given that masks can be distracting to the lis- 
tener (e.g., Carbon, 2020). This null effect may have been 
due to the relatively high accuracy participants reached 
in the prosody task and/or due to flanker and Simon tasks 
being more useful measures for studies on bilingualism 
and language switching rather than prosody perception 
(see Paap et al., 2017 for discussion). 

We found no effects of phonological short-term mem- 
ory (2-back, digit span) in either of our models. Because 
behavior on none of the tasks was correlated, it is pos- 
sible that the tasks were measuring potentially different 
facets of the construct typically referred to as “working 
memory capacity” (see Conway et al., 2005). One likely 
explanation for the observed pattern of phonological 
short-term memory results is that neither task suffi- 
ciently involved linguistic awareness. The digit span task 
involved numeric awareness and the 2-back task did not 
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sufficiently involve linguistic awareness given that par- 
ticipants only had to attend to a letter rather than a string 
of letters or word (Jacquemot & Scott, 2006; Jaeggi et al., 
2010). Research has also called into question the use of 
the digit span task as a measure of attention or memory 
(Groth-Marnat & Baker, 2003). 

Limitations to this study include the following: First, we 
did not record each prosody sentence with and without 
a mask. Presenting each sentence in both masked and 
unmasked conditions (across participants) would have 
been a preferable design choice, given that the sentences 
assigned to each mask condition may have differed in 
the strength of their prosodic cues, the extent to which 
they can be lip-read, etc. Second, the recorded happy/sad 
prosody was a simulated or portrayed prosody. In other 
words, our speaker was not actually happy or sad when 
producing the sentences. These emotional portrayals 
were based on stereotypical vocal expressions rather than 
psychophysiological effects on the voice (see Scherer, 
2003). We note, however, that portrayed prosodies are 
typically recognized by listeners as the intended emotion 
and all emotions are, to some extent, “portrayals” given 
the sociocultural norms of speech and emotion (see 
Banse & Scherer, 1996 for additional discussion). Third, 
the recordings were made in a fixed order, which intro- 
duced a potential confound between recording order and 
prosody condition. For example, statements were always 
recorded before questions. It is unclear to what degree 
(if any) this recording procedure affected the results. We 
acknowledge that a better design would have been to 
counterbalance the order of the prosody condition such 
that half the statements were recorded first and half were 
recorded second. Fourth, because we did not alter the 
audio it is unclear whether acoustic differences, on their 
own, were enough to drive the differences in performance 
between the mask conditions. For example, it remains an 
open question whether accuracy would remain the same 
had the stimuli produced without a mask been altered to 
lower the FO and amplitude to match the acoustic char- 
acteristics of the speech produced with a mask. We note 
that this would not change the fact that the masks still 
affect accuracy, but it would help us better pinpoint the 
reason for this accuracy decrease. 

To conclude, our results indicate that higher-level 
understanding of intonations and emotions can be hin- 
dered by face masks. Speakers should keep in mind the 
listener and consider increasing pitch and volume in cer- 
tain communicative contexts, if possible, when speaking 
with a mask. This may improve understanding. Increased 
pitch and volume may also be helpful when facial cues 
are obscured by the mask. Finally, speakers will come in 
contact with a wide variety of listeners. It is important to 
remember that not all listeners can detect subtle facial 
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cues associated with emotions—even without masks. 
Speech communication may benefit from explicit state- 
ments of emotion. At the very least, by explicitly stating 
an emotion like “I am happy,’ the listener will not wonder 
whether the speaker is smiling or frowning behind the 
mask. 
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